Deep Mean Field Theory: Layerwise Variance

ثبت نشده

چکیده

A recent line of work has studied the statistical properties of neural networks to great success from a mean field theory perspective, making and verifying very precise predictions of neural network behavior and test time performance. In this paper, we build upon these works to explore two methods for taming the behaviors of random residual networks (with only fully connected layers and no batchnorm). The first method is width variation (WV), i.e. varying the widths of layers as a function of depth. We show that width decay reduces gradient explosion without affecting the mean forward dynamics of the random network. The second method is variance variation (VV), i.e. changing the initialization variances of weights and biases over depth. We show VV, used appropriately, can reduce gradient explosion of tanh and ReLU resnets from exp(Θ( √ L)) and exp(Θ(L)) respectively to constant Θ(1). A complete phase-diagram is derived for how variance decay affects different dynamics, such as those of gradient and activation norms. In particular, we show the existence of many phase transitions where these dynamics switch between exponential, polynomial, logarithmic, and even constant behaviors. Using the obtained mean field theory, we are able to track surprisingly well how VV at initialization time affects training and test time performance on MNIST after a set number of epochs: the level sets of test/train set accuracies coincide with the level sets of the expectations of certain gradient norms or of metric expressivity (as defined in Yang and Schoenholz (2017)), a measure of expansion in a random neural network. Based on insights from past works in deep mean field theory and information geometry, we also provide a new perspective on the gradient explosion/vanishing problems: they lead to ill-conditioning of the Fisher information matrix, causing optimization troubles.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiscale Analysis of Transverse Cracking in Cross-Ply Laminated Beams Using the Layerwise Theory

A finite element model based on the layerwise theory is developed for the analysis of transverse cracking in cross-ply laminated beams. The numerical model is developed using the layerwise theory of Reddy, and the von Kármán type nonlinear strain field is adopted to accommodate the moderately large rotations of the beam. The finite element beam model is verified by comparing the present numeric...

متن کامل

Statestream: a Toolbox to Explore Layerwise- Parallel Deep Neural Networks

Building deep neural networks to control autonomous agents which have to interact in real-time with the physical world, such as robots or automotive vehicles, requires a seamless integration of time into a network’s architecture. The central question of this work is, how the temporal nature of reality should be reflected in the execution of a deep neural network and its components. Most artific...

متن کامل

Multi-Prediction Deep Boltzmann Machines

We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks ...

متن کامل

Weihrauch-completeness for layerwise computability

Layerwise computability is an effective counterpart to continuous functions that are almosteverywhere defined. This notion was introduced by Hoyrup and Rojas [17]. A function defined on Martin-Löf random inputs is called layerwise computable, if it becomes computable if each input is equipped with some bound on the layer where it passes a fixed universal Martin-Löf test. Interesting examples of...

متن کامل

Local Behavior of Discretely Stiffened Composite Plates and Cylindrical Shells

The Layerwise Shell Theory is used to model discretely stiffened laminated composite plates and cylindrical shells for stress, vibration, pre-buckling and post-buckling analyses. The layerwise theory reduces a 3-D problem to a 2-D problem by expanding the 3-D displacement field as a function of a surface-wise 2-D displacement field and a 1-D interpolation polynomial through the shell thickness....

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Deep Mean Field Theory: Layerwise Variance

ثبت نشده

چکیده

منابع مشابه

Multiscale Analysis of Transverse Cracking in Cross-Ply Laminated Beams Using the Layerwise Theory

Statestream: a Toolbox to Explore Layerwise- Parallel Deep Neural Networks

Multi-Prediction Deep Boltzmann Machines

Weihrauch-completeness for layerwise computability

Local Behavior of Discretely Stiffened Composite Plates and Cylindrical Shells

عنوان ژورنال:

اشتراک گذاری